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VARIATIONAL GRAM FUNCTIONS: CONVEX ANALYSIS AND 

OPTIMIZATION* * * § 

AMIN JALALit, MARYAM FAZELt, AND LIN XIAO§ 


Abstract. We propose a new class of convex penalty functions, called variational Gram func¬ 
tions (VGFs), that can promote pairwise relations, such as orthogonality, among a set of vectors in 
a vector space. These functions can serve as regularizers in convex optimization problems arising 
from hierarchical classification, multitask learning, and estimating vectors with disjoint supports, 
among other applications. We study convexity for VGFs, and give efficient characterizations for 
their convex conjugates, subdifferentials, and proximal operators. We discuss efficient optimization 
algorithms for regularized loss minimization problems where the loss admits a common, yet simple, 
variational representation and the regularizer is a VGF. These algorithms enjoy a simple kernel trick, 
an efficient line search, as well as computational advantages over first order methods based on the 
subdifferential or proximal maps. We also establish a general representer theorem for such learning 
problems. Lastly, numerical experiments on a hierarchical classification problem are presented to 
demonstrate the effectiveness of VGFs and the associated optimization algorithms. 

1. Introduction. Let xi,... ,Xm be vectors in R". It is well known that their 
pairwise inner products xfxj , for i,j = 1,... ,m, reveal essential information about 
their relative orientations, and can serve as a measure for various properties such as 
orthogonality. In this paper, we consider a class of functions that selectively aggregate 
the pairwise inner products in a variational form, 

(1) nM(xi,..., x™) = max YTj^i Vxj , 

where M is a compact subset of the set of m by m symmetric matrices. Let X = 
[xi ••• Xm] be an n X m matrix. Then the pairwise inner products xfxj are the 
entries of the Gram matrix and the function above can be written as 

(2) ^MiX) = max (X^X,M) = max tr(AMA^), 

MgM MgM 

where (A, B) = tr{A'^B) denotes the matrix inner product. We call Hjvc a variational 
Gram function (VGF) of the vectors xi,..., x^ induced by the set M. If the set M 
is clear from the context, we may write 11 (A) to simplify notation. 

As an example, consider the case where M is given by a box constraint, 

(3) M = {M: \Mij\< Mij, i,j = l,...,m}, 

where M is a symmetric nonnegative matrix. In this case, the maximization in the 
definition of picks either or My = —My depending on the sign of 

xfxj , for all i, j = 1,... ,m (if xfxj = 0, the choice is arbitrary). Therefore, 

(4) (A) = max J2Z=i I • 

Equivalently, nM(A) is the weighted sum of the absolute values of pairwise inner 
products. This function was proposed in [44] as a regularization function to promote 
orthogonality between selected pairs of linear classifiers in the context of hierarchical 
classification. 

*An earlier version of this work has appeared as Chapter 3 in [21]. 
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Observe that the function tr{XMX’^) is a convex quadratic function of X if M 
is positive semidefinite. As a result, the variational form r2]vi;(Ar) is convex if M is a 
subset of the positive semidefinite cone S™, because then it is the pointwise maximum 
of a family of convex functions indexed by M G M (see, e.g., [37, Theorem 5.5]). 
However, this is not a necessary condition. For example, the set M in (3) is not a 
subset of S™ unless M = 0, but the VGF in (4) is convex provided that the comparison 
matrix of M (derived by negating the off-diagonal entries) is positive semidefinite [44]. 
In this paper, we study conditions under which different classes of VGFs are convex 
and provide unified characterizations for the subdifferential, convex conjugate, and 
the associated proximal operator for any convex VGF. Interestingly, a convex VGF 
defines a semi-norm^ as 



( 5 ) 


If M C S™, then || A||m is the pointwise maximum of the semi-norms over 

all M e M. 

VGFs and the associated norms can serve as penalties or regularization functions 
in optimization problems to promote certain pairwise properties among a set of vector 
variables (such as orthogonality in the above example). In this paper, we consider 
optimization problems of the form 

(6) minimize £(A)-f A n]vt(A), 

where C{X) is a convex loss function of the variable A = [xi ••• x^], n(A) is 
a convex VGF, and A > 0 is a parameter to trade off the relative importance of 
these two functions. We will focus on problems where C{X) is smooth or has an 
explicit variational structure, and show how to exploit the structure of £(A) and n(A) 
together to derive efficient optimization algorithms. More specifically, we employ a 
unihed variational representation for many common loss functions, as 


( 7 ) 


C{X) = max (A,T>(g)) - £(g), 


where £ : M?’ —>■ K is a convex function, C/ is a convex and compact subset of and 


£> : RP —>■ is a linear operator. Exploiting the variational structure in both the 

loss function and the regularizer allows us to employ efficient primal-dual algorithms, 
such as mirror-prox [35], which now only require projections onto M and Q, instead 
of computing subgradients or proximal mappings for the loss and the regularizer. 

Unfolding the structure for loss functions and regularizers as above, allows us to 
provide a simple preprocessing step for dimensionality reduction, presented in Section 
5.2, which can substantially reduce the per iteration cost of any optimization algorithm 
for (6). As another byproduct of these structures, we also present a general representer 
theorem for problems of the form (6) in Section 5.3 where the optimal solution is 
characterized in terms of the input data in a simple and interpretable way. 

Organization. In Section 2, we give more examples of VGFs and explain the 
connections with functions of Euclidean distance matrices and robust optimization. 
Section 3 studies the convexity of VGFs, as well as their conjugates, semidefinite 
representability, corresponding norms and subdifferentials. Their proximal operators 


semi-norm satisfies all the properties of a norm except that it can be zero for a nonzero input. 
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are derived in Section 4. In Section 5, we study a class of structured loss minimiza¬ 
tion problems with VGF penalities, and show how to exploit their structure, to get 
an efficient optimization algorithm using a variant of the mirror-prox algorithm with 
adaptive line search, to use a simple preprocessing step to reduce the computations in 
each iteration, and to provide a characterization of the optimal solution as a represen¬ 
ter theorem. Finally, in Section 6, we present a numerical experiment on hierarchical 
classification to illustrate the application of VGFs. 

Notation. In this paper, denotes the set of symmetric matrices in 
and C is the cone of positive semidefinite (PSD) matrices. We may omit 
the superscript m when the dimension is clear from the context. The symbol ^ 
represents the Loewner partial order and (•,•) denotes the inner product. We use 
capital letters for matrices and bold lower case letters for vectors. We use X G 
and X = vec(X) G R"™ interchangeably, with denoting the ith column of X; 
i.e., X = [xi ••• Xm]. 1 and 0 denote matrices or vectors of all ones and all zeros 
respectively, whose sizes would be clear from the context. The entry-wise absolute 
value of X is denoted by |Ar|. || • ||p denotes the ip norm of the input vector or 

matrix, and || • ||f denotes the Frobenius norm (similar to £2 vector norm). The 
convex conjugate of a function / is defined as f*{y) = sup^ {x,y) — f{x ), and the 
dual norm of || • || is defined as ||y||* = sup{(x,y) : ||x|| < 1}. argmin (argmax) 
returns an optimal point to a minimization (maximization) program while Arg min 
(or Argmax) is the set of all optimal points. The operator diag(-) is used to put a 
vector on the diagonal of a zero matrix of corresponding size, to extract the diagonal 
entries of a matrix as a vector, or for zeroing out the off-diagonal entries of a matrix. 
We use f = g to denote f{x) = g{x) for all x G dom(f) = dom(g). 

2. Examples and connections. In this section, we present examples of VGFs 
associated to different choices of the set M. The list includes some well known func¬ 
tions that can be expressed in the variational form of (1), as well as some new ones. 

Vector norms. Any vector norm || • || on R™ is the square root of a VGF defined 
by M = {uu^ : ||u||* < 1}. For a column vector x G R™ , the VGF is given by 

r2M(x^) = max {tr(x^uu^x) : ||u||* < 1} = max {(x^u)^ : ||u||* < 1} = |lx|p . 

U U 

As another example for when n = 1, consider the case where M is a compact 
convex set of diagonal matrices with positive diagonal entries. The corresponding 
VGF (and norm) is defined as 

(8) Dm(x^) = max X)™ 1 = ||x||^, 

@ediag(M) 

and the dual norm can be expressed as (||x||*)^ = inf6iediag(M) ■ This norm 

and its dual were first introduced in [32] , in the context of regularization for structured 
sparsity, and later discussed in [3]. The fc-support norm [2], which is a norm used 
to encourage vectors to have k or fewer nonzero entries, is a special case of the dual 
norm given above, corresponding to M = {diag(0) : 0 < < 1, 1^9 = k} . 

Norms of the Gram matrix. Given a symmetric nonnegative matrix M, we can 
dehne a class of VGFs based on any norm || • || and its dual norm || • ||*. Gonsider 

(9) U = {K oM : \\K\\* <1, K'^ = K}, 

where o denotes the matrix Hadamard product, {KoM)ij = KijMij for all i,j. Then, 

Dm(A:)= max (K oM,X^X) = max {K,Mo(X^X)) = \\Mo(X^X)\\. 

^ ’ iiNr<w ' iiAr<w ’ ^ ;/ II V Ri 
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The followings are several concrete examples. 

(i) If we let II • II* in (9) be the £oo norm, then M = {M : \Mij/Mij\ < 1, i,j = 
1,... ,m}, which is the same as in (3). Here we use the convention 0/0 = 0, thus 
Mij = 0 whenever Mij = 0. In this case, we obtain the VGF in (4): 

Hm(X) = ||Mo (X^X)lli = 

(ii) If we use the £2 norm in (9), then M = {M : ^ l} 

(10) Hm(X) = ||Mo (X^X)II;^ = 

This function has been considered in multi-task learning [40], and also in the context 
of super-saturated designs [8, 13]. 

(hi) Using £i norm in (9) gives M = {M : YTj \ — 1} 

(11) Hm(-^) = ||Mo (X^X)lloo = max Mijlxjxj]. 

This case can also be traced back to [8] in the statistics literature, where the maximum 
of |x/’xj| for i ^ j is used as the measure to choose among supersaturated designs. 

Many other interesting examples can be constructed this way. For example, one 
can model sharing vs competition using group-£i norm of the Gram matrix which was 
considered in vision tasks [22]. We will revisit the above examples to discuss their 
convexity conditions in Section 3. 

Spectral functions. From the definition, the value of a VGF is invariant under 
left-multiplication of X by an orthogonal matrix, but this is not true for right multi¬ 
plication. Hence, VGFs are not functions of singular values (e.g., see [27]) in general, 
and are functions of the row space of X as well. This also implies that in general 
H(V) ^ H(V^). However, if the set M is closed under left and right multiplication 
by orthogonal matrices, then Hm(V) becomes a function of squared singular values 
of X. For any matrix M G , denote the sorted vector of its singular values by 
a{M) and let 0 = {cr(M) : M G M}. Then we have 

(12) Hjvt(V) = max tr(XMX'^) — max g uHV)^ , 

as a result of Von Neumann’s trace inequality [33]. Note the similarity of the above 
to the VGF in (8). As an example, consider 

(13) M = {M : ail <M ^ aa/, tr(M) = ag}, 

where 0 < ai < aa and as G [mai, maa] are given constants. The so called spectral 
box-norm [31] is the dual to the norm of the form (5) defined via this M. Note that 
in this case, M C S™ , so it is easy to see that is convex. The square of this norm 
has been considered in [20] for clustered multitask learning where it is presented as a 
convex relaxation for /c-means. 

Finite set M . For a finite set M = {Mi,..., Mp} C §!/ , the VGF is given by 
^^m(V) = max \\XMZ^\\p, 

i.e., the pointwise maximum of a finite number of squared weighted Frobenius norms. 

In the following subsections, we consider classes of VGFs which can be used 
in promoting diversity, have connections to Euclidean distance matrices, or can be 
interpreted under a robust optimization framework. 
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2.1. Diversification. Certain VGFs can be used for diversifying certain pairs 

of columns of the input matrix; e.g., minimizing (4) pushes to zero the inner prod¬ 
ucts corresponding to the nonzero entries in M as much as possible. As an¬ 

other example, observe that two non-negative vectors have disjoint supports if and 
only if they are orthogonal to each other. Hence, using a VGF as (4), f2]vt(A') = 

1 that promotes orthogonality, we can define 

(14) ^(X) = Hm(|A|) 

to promote disjoint supports among certain columns of A; hence diversifying the 
supports of columns of X. Convexity of (14) is discussed in Section 3.6. Different 
approaches has been used in machine learning applications for promoting diversity; 
e.g., see[29, 26, 19] and references therein. 

2.2. Functions of Euclidean distance matrix. Consider a set M C S™ with 
the property that Ml = 0 for all M S M. For every M G M, let A = diag(M) — M 
and observe that 


tr(AMA^)=Er,=,M,, 


— i 4 ■ 


This allows us to express the associated VGF as a function of the Euclidean distance 
matrix D, which is defined by Dij = ^jjx^ — XjjH for i,j = 1,... ,m (see, e.g., [9, 
Section 8.3]). Let A = {diag(M) — M : M G M} . Then we have 

Hm(A) = max tr(AMA^) = max (A, D). 


A sufficient condition for the above function to be convex in X is that each A G A 
is entrywise nonnegative, which implies that the corresponding M = diag(Al) — A is 
diagonally dominant with nonnegative diagonal elements, hence positive semidefinite. 
However, this is not a necessary condition and can be convex without all A’s 
being entrywise nonnegative. 

2.3. Connection with robust optimization. The VGF-regularized loss min¬ 
imization problem has the following connection to robust optimization (see, e.g., [7]): 
the optimization program 

minimize max £(A)-|-tr(AMA^) 

X M6M 


can be interpreted as seeking an X with minimal worst-case value over an uncer¬ 
tainty set M. Alternatively, when M C , this can be viewed as a problem with 
Tikhonov regularization jj where the weight matrix M^^^ is subject to errors 

characterized by the set M . 

We close this section by pointing out that VGFs are different from Quadratic 
Support Functions introduced in [1]. A closer notion to a VGF is the support func¬ 
tionals studied independently in [10] which correspond to VGFs associated to affine 
sets M, while also allowing for inhomogeneous quadratic functions in their definition. 

3. Convex analysis of VGF. In this section, we study the convexity of VGFs, 
their conjugate functions and subdifferentials, as well as the related norms. 

First, we review some basic properties. Notice that is the support function 
of the set M at the Gram matrix X^X; i.e., 

(15) Hm(A) = max tr(AMA^) = SmIX'^X) 
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where the support function of a set M is defined as = sup^gjyj {M,Y) (see, 

e.g., [37, Section 13]). By properties of the support function (see [37, Section 15]), 

= ^conv(M) 5 

where conv(M) denotes the convex hull of M. It is clear that the representation of a 
VGF (i.e., the associated set M) is not unique. Henceforth, without loss of generality 
we assume M is convex unless explicitly noted otherwise. Also, for simplicity we 
assume M is a compact set, while all we need is that the maximum in (1) is attained. 
For example, a non-compact M that is unbounded along any negative semidefinite 
direction is allowed. Lastly, we assume 0 G M. 

Moreover, VGFs are left unitarily invariant; for any Y G orthog¬ 
onal matrix U G where UU^ = = J, we have H(y) = Q{UY) and 

fl*(Y) = fl*{UY); use (2) and (19). We use this property in simplifying computa¬ 
tions involving VGFs (such as proximal mapping calculations in Section 4) as well as 
in establishing a general kernel trick and representer theorem in Section 5.2. 

As we mentioned in the introduction, a sufficient condition for the convexity of 
a VGF is that M C S™. In Section 3.1, we discuss more concrete conditions for 
determining convexity when the set M is a polytope. In Section 3.2, we describe a 
more tangible sufficient condition for generaf sets. 

3.1. Convexity with polytope M. Consider the case where M is a polytope 
with p vertices, i.e., M = conv{Mi,..., Mp} . The support function of this set is 
given as S'm(L") = maxj=i^,..^p (V, Mi) and is piecewise linear [39, Section 8.E]. For a 
polytope M, we define Me® as a subset of {Mi,... ,Mp} with the smaflest possibfe 
size satisfying S'm(V^A) = S'Metf(-A^X) for all X G 

As an example, for M = {M : \Mij\ < Mij, i,j = 1,... ,m} which gives the 
function defined in (4), we have 

(16) Meff C {M : Mu = Mu , Mij = ±My iov i ^ j} . 

Whether the above inclusion holds with equality or not depends on n. 

Theorem 1. For a polytope M C S™, the associated VGF is eonvex if and only 
if 

Proof. Obviously, Meg C S™ ensures convexity of maxMGMeff ti{XMX'^) = 
Om(A). Next, we prove necessity of this condition for any Meg. Take any Mi G Meg. 
If for every X G R”^™ with H(A) = tr(VMiA^) there exists another Mj G Meg with 
r2(A) = tr(VMjX^), then Meg\{Mi} is an effective subset of M which contradicts 
the minimality of Meg. Hence, there exists Xi such that H(Xi) = ti{XiMiXf) > 
iri^XiMjXf) for all j i. Hence, H is twice continuously differentiable in a small 
neighborhood of Xi with Hessian V^r2(vec(Ai)) = Mi 0 where 0 denotes the 
matrix Kronecker product. Since H is assumed to be convex, the Hessian has to be 
PSD which gives Mi ^0. □ 

Next we give a few examples to illustrate the use of Theorem 1. 

(i) We begin with the example defined in (4). Authors in [44] provided the 
necessary (when n > m—1) and sufficient condition for convexity using resuits from M- 
matrix theory:^irst, define thejcomparison matrix M associated to the nonnegatiye 
matrix M as Mu = Mu and Mij = —Mij for i j ■ Then Dm is convex if M 
is positive semidefinite, and this condition is aiso necessary when n > m — 1 [44]. 
Theorem 1 provides an alternative and more general proof. Denote the minimum 
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eigenvalue of a symmetric matrix M by Amin(A7). From (16) we have 

min Amin(M) = min z'^Mz > min M^zf — MiAzizA 

MGMeff MGMett ||z||2 = l^ 1 j j 

||z ||2 = l * I-AJ 

(17) = min \zfM\z\ > Amin(M). 

I|z||2 = l 

When n > m — 1, one can construct X G such that all off-diagonal entries of 

X are negative (see the example in Appendix A.2 of [44]). On the other hand, 
Lemma 2.1(2) of [12] states that the existence of such a matrix implies n > m — 1. 
Hence, M G Me® if and only if n > m — 1. Therefore, both inequalities in (17) should 
hold with equality, which means that MeS C if and only if M ^0. By Theorem 
1, this is equivalent to the VGF in (4) being convex. If n < m — 1, then Meg may not 
contain M, thus M ^ 0 is only a “sufficient” condition for convexity for general n. 



Fig. 1: The positive semidefinite cone, and the set in (3) defined by M = 
[1, 0.8; 0.8, 1], where 2x2 symmetric matrices are embedded into . The thick edge 
of the cube is the set of all points with the same diagonal elements as M (see (16)), 
and the two endpoints constitute Meg . Positive semidefiniteness of M is a necessary 
and sufficient condition for the convexity of Hm : —>■ M for all n > m — 1 = 1. 


(ii) Similar to the set M above, consider a box that is not necessarily symmetric 
around the origin. More specifically, let M = {M G S’” : Mu = Du , |M — C] < D} 
where C (denoting the center) is a symmetric matrix with zero diagonal, and £) is a 
symmetric nonnegative matrix. In this case, we have Meg C {M : Mu = Du , Mij = 
CijAzDij for i ^ j}. When used as a penalty function in applications, this can capture 
the prior information that when Xj is not zero, a particular range of acute or obtuse 
angles (depending on the sign of Cy) between the vectors is preferred. Similar to (17), 

min \min{M) > min [zJ^iAlz] -b z^Cz > X^in{D) + Aniin(C'), 

MGMeff ||z||2 = l 

where D is the comparison matrix associated to D. Note that C has zero diagonals 
and cannot be PSD. Hence, a sufficient condition for convexity of defined by an 
asymmetric box is that Amin (D) + Amin(C') > 0. 

(hi) Consider the VGF defined in (11), whose associated variational set is 

(18) M = {M e S’” : M.,^0 \M^,/M^J | < 1, M,, = 0 if M., = 0}, 

where M is a symmetric nonnegative matrix. Vertices of M are matrices with ei¬ 
ther only one nonzero value Mu on the diagonal, or two nonzero off-diagonal en¬ 
tries at (i,j) and (j, i) equal to or — . The second type of matrices 
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cannot be PSD as their diagonal is zero, and according to Theorem 1, convexity 
of requires these vertices do not belong to Meg . Therefore, the matrices in Meg 
should be diagonal. Hence, a convex VGF corresponding to the set (18) has the 
form fl{X) = maxj=i^...^m A<fM||xi ||2 • To ensure such a description for Meg we need 
max{Mii||xi|| 2 , Mjjllxjlll} > My |xfxj| for all i, j and any X € which is 

equivalent to MaMjj > for all i,j . This is satisfied if M ^ 0 . However, positive 
semidefiniteness is not necessary. For example, all of three principal minors of the 
following matrix are nonnegative but it is not PSD: M = [1,1, 2; 1, 2,0; 2,0, 5] ^ 0. 

3.2. A spectral sufficient condition. As mentioned before, it is generally not 
clear how to provide easy-to-check necessary and sufficient convexity guarantees for 
the case of non-polytope sets M. However, simple sufficient conditions can be easily 
checked for certain classes of sets M, for example spectral sets (Lemma 2). We first 
provide an example and consider a specialized approach to establish convexity, which 
illustrates the advantage of a simple guarantee as the one we present in Lemma 2. 

(i) Consider the VGF defined in (10) and its associated set given in (9) when 
we plug in the Frobenius norm; i.e., 

M = {KoM-. ||A||f<1, K^ = K}. 

In this case, M is not a polytope, but we can proceed with a similar analysis as in 
the previous subsection. In particular, given any X G the value of Hm(V) is 

achieved by an optimal matrix Kx = {M o X"^X)/\\M o X"^X\\f ■ We observe that, 

M y 0 => M o M y 0 Kx o M y 0 , MX Qm is convex. 

The first implication is by Schur Product Theorem [18, Theorem 7.5.1] and does not 
hold in reverse; e.g., M o M = [1,1, 2; 1, 2, 3; 2, 3, 5.01] ^ 0 while M ^ 0. The second 
implication, from left to right, is again by Schur Product Theorem. The right to left 
part is by observing that for any n > I, X can always be chosen to select a principal 
minor of M o M. The third implication is straightforward; pointwise maximum of 
convex quadratics is convex. All in all, a sufficient condition for Hm being convex 
is that the Hadamard square of M, namely M o M, is PSD. It is worth mentioning 
that when M o M ^ 0, hence real, nonnegative and PSD, it is referred to as a doubly 
nonnegative matrix. 

Denote by M+ the orthogonal projection of a symmetric matrix M onto the PSD 
cone, which is given by the matrix formed by only positive eigenvalues and their 
associated eigenvectors of M. 

Lemma 2 (a sufficient condition). Hm is convex provided that for any M G M 
there exists M' G M such that M+ ^ M'. 

Proof. For any X, tr(AMA^) < tr(AM+A^) clearly holds. Therefore, 

f^M(V) = max tr(AMA^) < max trlXM+X'^). 

MgM MgM 

On the other hand, the assumption of the lemma gives 

max tr(AM+A^) < max tr(AM'A^) = rijvt(A) 

MeM M'gM 


which implies that the inequalities have to hold with equality, which implies that 
Hm(A) is convex. Note that the assumption of the lemma can hold while M+ 
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On the other hand, it is easy to see that the condition in Lemma 2 is not necessary. 
Consider M = {M G : |My | < 1}. Although the associated VGF is convex 
(because the comparison matrix is PSD), there is no matrix M' G M satisfying M' ^ 
M+, where M = [0,1; 1,1] G M and M+ ~ [0.44,0.72; 0.72,1.17], as for any M' G M 
we have (M' — M +)22 < 0. 

As discussed before, when M is a polytope, convexity of Dm = DMeff is equivalent 
to Meff C S™. For general sets M, we showed that M+ C M is a sufficient condition 
for convexity. Similar to the proof of Lemma 2, we can provide another sufficient 
condition for convexity of a VGF: that all of the maximal points of M with respect to 
the partial order defined by (the Loewner order) are PSD. These are the points 
M G M for which (M — M) fl S™ = {0™}. In all of these pursuits, we are looking for 
a subset M' of PSD cone such that Dm = Dm'- When such a set exists. Dm is convex 
and many optimization quantities can be computed for it. 

Hereafter, we assume there exists a set M' C MnS+ for which Dm = Dm', which 
in turn implies Dm = DMns+- For example, based on Theorem 1, this property holds 
for all convex VGFs associated to a polytope M. 

3.3. Conjugate function. For any function D , the conjugate function is defined 
as D*(V) = supjif {X,Y) — D(X) and the transformation that maps D to D* is called 
the Legendre-Fenchel transform (e.g., [37, Section 12]). 

Lemma 3 (conjugate VGF). Consider a convex VGF associated to a compact 
convex set M with Dm = DMnS+ • The conjugate function is 

(19) D^(V) = iinf{tr(VM'fv'^) : range(V^) C range(M), M G MnSVj, 


where is the Moore-Penrose pseudoinverse of M. 

Note that D*(V) is +oo if the optimization problem in (19) is infeasible; i.e., if 
range(V^) range(M) for all M G M fl S!p; equivalently, if V(/ — MM^) is nonzero 
for all M G M n S!(*, where MM+ is the orthogonal projection onto the range of M. 
This can be seen using generalized Schur complements; e.g., see Appendix A.5.5 in 
[9] or [11]. 

Proof. By our assumption, that Dm = DMnS+; we get D^ = D^^g^. Define 


( 20 ) 


/m(V) = iinf tr(C) 


M, C 


M Y^' 
Y C 


^ 0 , M G M 


The positive semidefiniteness constraint implies M F 0, therefore /m = /Mns+- Its 
conjugate function is 


( 21 ) 


sup sup \{X,Y) - itr(C') 
Y M,C I 


M Y'^' 
Y C 


0, M G M 


sup sup I (A, Y)-\ tr(C') 
MeMn§+ Y,c I 


M 

Y 



Gonsider the dual of the inner optimization problem over Vand C. Let TV ^ 0 be the 
dual variable with corresponding blocks, and write the Lagrangian as 


L(V, C, W) = {X, Y)-\ tr(C) + (VFn, M) + 2(1^21, Y) + (W 22 , C), 
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whose maximum value is finite only if W 21 = and W 22 = \l■ Therefore, the 

dual problem is 

min I 

Wii I I 2^ 4 

which is equal to {M,X'^X). Plugging in (21), we conclude = nMn§+- 

Next, convexity and lower semi-continuity of /m imply = /m (e.g., [39, 
Theorem 11.1]). Therefore, /m is equal to which we showed to be equal to 

. Taking the generalized Schur complement of the semidefinite constraint in (20) 
gives the desired representation in (19). □ 

Note that (20) is preferred to a representation with M fl S+ substituted for M. This 
is because M can have a much simpler representation than Mfl §+; e.g., as for (3). 

3.4. Related norms. Given a convex VGF with fljvc = i^Mns+, we have 

rjM(d^) = sup tT{XMX^)= sup \\XM^/^fp. 

MGMnS+ MGMnS+ 



This representation shows that is a semi-norm: absolute homogeneity holds, 

and it is easy to prove the triangle inequality for the maximum of semi-norms. The 
next lemma, which can be seen from Corollary 15.3.2 of [37], generalizes this assertion. 

Lemma 4. Suppose a function fl : r jg homogeneous of order 2, i.e., 

n{9X) — 0^fl(X). Then its square root jjXjj = ^yQ{X) is a semi-norm if and only 
if fl is convex. If fl is strictly convex then '/fi is a norm. 

Dual Norm. Considering jj • jjjyt = , we have ^flM = 5 II • |1m • Taking the 

conjugate function of both sides yields 211^ = i(j] • IIm)^ where we used the order-2 
homogeneity of Hm . Therefore, II ■ IIm = 2yjfi^. Given the representation of 17^ in 
Lemma 3, one can derive a similar representation for as follows. 

Theorem 5. For a convex VGF r2jvt associated to a nonempty compact convex 
set M, with fljvc = ^2jvtns+; 


(22) lly]]^ = 2^n*^{Y) = i mf^ I tr(C) + 7 m(M) 


M Y^' 
Y C 


Y 0 


where 7 m (Tf) = inf {A > 0 : M G AM} is the gauge function associated to M. 

Proof. The square root function, over positive numbers, can be represented in a 
variational form as y/y = min {a + ^ : a > 0} . Without loss of generality, suppose 
M is a compact convex set containing the origin. Provided that r2}^^(y) > 0, from 
the variational representation of a conjugate VGF function we have 


= i ^inf^^ {a + i tr(VM+r^) : range(y^) C range(M), M € M n S™} 

= j inf (a+ tT(YM^Y'^) : range(y^) C range(M), M G a(M fl SV)} 

^ M,ci>0 ^ ■’ 

where we used {aM^ = /a and performed a change of variable. The last repre¬ 
sentation is the same as the one given in the statement of the lemma, as the constraint 
restricts M to the PSD cone, for which ^m{M) = 7MnS+(M^). On the other hand, 
when D^(V) = 0, the claimed representation returns 0 as well because 0 G M. □ 

As an example, M = {M F 0 : tr(M) < 1} gives 7 M(Af) = tr(M) which if 
plugged in (22) yields the well-known semidefinite representation for nuclear norm. 
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3.5. SubdifFerentials. In this section, we characterize the subdifferential of 
VGFs and their conjugate functions, as well as that of their corresponding norms. 
Due to the variational definition of a VGF where the objective function is linear in 
M, and the fact that M is assumed to be compact, it is straightforward to obtain the 
subdifferential of Dm (e.g., see [17, Theorem 4.4.2]). 

Proposition 6. For a convex VGF Dm = DMns+, the subdifferential at X is 
given by 


5Dm(^) = conv{2A:M : tr{XMX^) = n{X), M G MnS+} . 

For the norm ||X||m = VDm, we have 5||X||m = 2 ||x||j^ ^ ^m(^) if ^ 0. 

As an example, the subdifferential of D(X) = from (4), is given by 

(23) dn{X) = {2XM : Mij = Mij sign(xfxj) if (xi,Xj) 0 , 

Mii = Mii ^ \XIij\ < Mij otherwise} . 

Proposition 7. For a convex FGF Dm = DmoS+j the subdifferential of its con¬ 
jugate function is given by 

dn*j^{Y) = {i(FMf + W) : D(rMf + IF) = 4D*(r) = tr(FMfr'^), 

^ ^ range(IF^) C ker(M) C ker(F), M G M fl S+} . 

When D}j^(F) 0 we have 5||y||)(^ = p5|^9D}^(F) . 

Proof. We use the results on subdifferentiation in parametric minimization [39, 
Section 10.G]. First, let’s fix some notation. Throughout the proof, we denote ^D by 
D, and 2D* by D*. Denote by cm{M) the indicator function of the set M which is 
1 when M G M and +00 otherwise. We use M instead of M fl S+ to simplify the 
notation. Considering 


f(Y M) ■= I ^ tr(FM't'F^) if range(F^) C range(M) 

’ I +00 otherwise 

we have D*(F) = infM /(F, M) + lm.{M) . For such a function, we can use results in 
[10, Theorem 4.8] to show that 

df{Y,M) = com {{Z,-\Z'^Z)-. F = FAff + IF , range(IF^) C ker(M)} . 

Since ^(F, M) := f{Y, M) + lm{M) is convex, we can use the second part of Theorem 
10.13 in [39]: for any choice of Mq which is optimal in the definition of D*(F), 

5D*(F) = {F: iZ,0)Gdg{Y,Mo)}. 

Therefore, for any Z G 9D*(F) we have \Z'^Z G 5tM(A7o) = {G : {G,M' — Mq) < 
0, VM' G Mj (Here 5 (.m(A7o) is the normal cone of M at Mq.) This implies 
i tr(Z'M'F^) < i tr(FMoF^) for all M' G M. Taking the supremum of the left 
hand side over all M' G M, we get 

D(Z) = i tr(FMoF^) = ^ tr(FMfF^) = D*(F), 
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where the second equality follows from range(iy^) C ker(Mo) (which is equivalent to 
MqW'^ = 0). Alternatively, for any matrix Z from the right hand side of (24) (after 
adjustment to our rescaling of definition of 17 by i), and any Y' G we have 

17*(y') > {¥', z) - i7(z) = {¥', z) - 17*(y) = {¥' - y, y) + i7*(y) 

where we used Fenchel’s inequality, as well as the characterization of Z. Therefore, 
Z G dfl*(Y). This finishes the proof. Note that for an achieving M, ker(M) C ker(y) 
(i.e., range(y^) C range(M)) has to hold for the conjugate function to be defined. □ 

Since 917* (y) is non-empty, for any choice of Mq , there exists a W such that 
^{YMq -|- W) G 917*(y). However, finding such W is not trivial. The following 
lemma characterizes the subdifferential as the solution set of a convex optimization 
problem involving 17 and affine constraints. 

Lemma 8 . Given Y and an optimal Mq, 

917*(y) = Argmin^ |l7(y) : Z = ^(YM^ + W ), WMqMI = o|, 
where WMqM^ = Q is equivalent to range(iy'^) C ker(Mo). 

This is because for all feasible Z we have 17(Z) > tvlZMoZ"’") = 17* (y). Moreover, 
notice that the optimality of Mq implies ker(Mo) C ker(y). 

The characterization of the whole subdifferential is helpful for understanding op¬ 
timality conditions, but algorithms only need to compute a single subgradient, which 
is easier than computing the whole subdifferential. 

3.6. Composition of VGF and absolute values. The characterization of 
the subdifferential allows us to establish conditions for convexity of 4'(A) = 17(|A|) 
defined in (14). Our result is based on the following Lemma. 

Lemma 9. Given a funetion f : R" —>■ M, consider g(x) = miny>|x| f{y), and 
h{'K) = /(|x|), where the absolute values and inequalities are all entry-wise. Then, 

(a) h** < g < h . 

(b) If f is convex then g is convex and g = h**. 

Proof, (a) In h*{y) = sup^. {(x;, y) — /(|x|)} , the optimal x has the same sign 
pattern as y ; hence h*{y) = supx>o {(x, |y|) — /(x)} . Next, we have 

/i**(z) = sup {(y,z) - sup {(x, |y|) -/(x)}} = sup inf {(y, |z|) - (x, y)-f/(x)} 
y x >0 y> 0 x >0 

< inf sup {(y, |z|) - (x,y)-f/(x)} = inf sup {(y, |z| - x)-k/(x)} 

x> 0 y >0 x> 0 y >0 

= inf /(x) = 5(z)- 

x>|z| 

This shows the first inequality in part (a). The second inequality follows directly from 
the definition of g and h. 

(b) Consider Xi,X 2 G R" and 0 G [0,1] . Suppose ( 7 (xi) = /(y^) for some y^ > jx^j, 
for i = 1, 2 . In other words, y^ is the minimizer in the definition of g{'^i). Then, 

^'yi + (1 - S)y2 > ^'Ixij + (1 - 6 »)|x 2 | > jdxi -k (1 - 6 I)x 2 | . 

By definition of g and convexity of / 

5 ( 6 »xi-k(l- 6 »)x 2 ) < /( 6 'yi + (l- 6 ')y 2 ) < 6 '/(yi) + (l- 6 ')/(y 2 ) = 6 »g(xi)-k(l-%(x 2 ), 
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which implies that g is convex. It is a classical result that the epigraph of the bi¬ 
conjugate h** is the closed convex hull of the epigraph of h; in other words, h** is 
the largest lower semi-continuous convex function that is no larger than h (e.g., [37, 
Theorem 12.2]). Since g is convex and h** < g < h, we must have h** = g . □ 

Corollary 10. Let be a convex VGF. Then, flMd^l) is a convex function 
of X if and only ifnM{\X\) = miny>|x| • 

Proof. Let JIm be the function / in Lemma 9. Then we have h{X) = 
and g{X) = miny>|xj Since here /i is a closed convex function, we have 

h = h** [37, Theorem 12.2], thus part (a) of Lemma 9 implies h = g. On the other 
hand, given a convex function /, part (b) of Lemma 9 states that g = h** is also 
convex. Hence, h = g implies convexity oi h . □ 

Another proof of Corollary 10, in the case where is a norm and not a 

semi-norm, is given as Lemma 15 in the Appendix. 

Lemma 11. Let he a convex VGF with = H]v(ns+ ■ U dLlM{X) ^ 

0 holds for any A > 0, then 'I'(A) = HjvtdA|) is convex. 

Proof. Using the definition of subgradients for H at \X\ we have 

n{\X\ + A)> HdA|) -L sup{(G, |A| -L A) : G e dn at |A|}, 

where the right-most term is the directional derivative of H at |A| in the direction 
A. From the assumption, we get H(y) > HdA|) for all Y > |A|. Therefore, 
T(A) = HMdA|) = miny>|x| Hm(U) • Corollary 10 establishes the convexity of T .□ 

For example, consider the VGF defined in (4), and assume that it is convex. 
Its subdifferential cAIm given in (23). For each A > 0, the matrix product AM > 0 
since M is also a nonnegative matrix, hence it belongs to cA2m(A). Therefore the 
condition in the above lemma is satisfied, and the function 4'(A) = HMdAl) is 
convex and has an alternative representation 4*(A) = miny>|x| LIm(Y) . This specific 
function 4/ has been used in [42] for learning matrices with disjoint supports. 

4. Proximal operators. The proximal operator of a closed convex function h{-) 
is defined as prox^(x) = argmin^ {/i(u) -|- i||u — xjH}, which always exists and is 
unique (e.g., [37, Section 31]). Computing the proximal operator is the essential step 
in the proximal point algorithm ([30, 38]) and the proximal gradient methods (e.g., 
[36]). In each iteration of such algorithms, we need to compute prox^^() where r > 0 
is a step size parameter. To simplify the presentation, assume M C S™ and consider 
the associated VGF. Then, 

(25) prox^Q(A) = argmin max ||| V — A|||.-|-r tr(VMV^). 

Y MeM 

Since M C S+ is a compact convex set, one can change the order of min and max and 
first solve for Y in terms of any given A and M, which gives Y = X{L + 2tM)~^. 
Then we can find the optimal Mq G M given A as 

Mq = argmin tr (A(/ -|- 2 tM)“^A^) 

MeM 

which gives prox^Q(A) = A(/-|-2 tMo)“^ . To compute the proximal operator for the 
conjugate function U* , one can use Moreau’s formula (see, e.g., [37, Theorem 31.5]): 

PI'Ox^q(A) -Lr"iprox^-if2.(A) = A. 


(26) 
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Next we discuss proximal operators of norms induced by VGFs (section 3.4). 
Since computing the proximal operator of a norm is related to projection onto the 
dual norm ball, i.e., prox.,.||.||(X) = X — n||.||.<T-(X), we can express the proximal 
operator of the norm 11-11 = i/flM)-) as 

prox^ll n(X) = X — argminmin -| ||y — X\\p : tr(C) < 

F M,C L 

using (20), (22). Moreover, plugging (22) in the definition of proximal operator gives 

prox^l|.||. (X) = argminmin I ||r - X\\% + T(tr(C') + : 

Y M,c I 



M Y'^' 
Y C 


^ 0, M e M 


where = inf{A > 0 : M G AM} is the gauge function associated to the 

nonempty convex set M. The computational cost for computing proximal operators 
can be high in general (involving solving semidefinite programs); however, they may 
be simplified for special cases of M . For example, a fast algorithm for computing the 
proximal operator of the VGF associated with the set M defined in (13) is presented in 
[31]. For general problems, due to the convex-concave saddle point structure in (25), 
we may use the mirror-prox algorithm [35] to obtain an inexact solution. 

Left unitarily invariance and QR faetorization. As mentioned before, VGFs and 
their conjugates are left unitarily invariant. We can use this fact to simplify the 
computation of corresponding proximal operators when n > m. Consider the QR 
decomposition of a matrix Y = QR where Q is an orthogonal matrix with Q^Q = 
QQ^ = I and R = [Ry 0]^ is an upper triangular matrix with Ry G From the 

definition, we have Ll{Y) = Ll{Ry) and Ll*{Y) = Ll*{Ry). For the proximal operators, 
we can simply plug in Rx from the QR decomposition X = Q[Rj^ 0]^ to get 


prox^Q, (X) = argminmin 
F M,C 


= Q ■ arg min min 
R M,C 


Y-X\\l + \TtT{C) : 

R-Rx\\l + \ 


M 

Y 


+ fTtT:{C) : 


Y^' 

C 

R 


^ 0, M G M 


R^' 

C 


^ 0, M G M 


where R is constrained to the set of upper triangular matrices and the new PSD 
matrix is of size 2m instead of n -|- m that we had before. The above equality uses 
two facts. First, 


'Im 0 ■ 

'M Y^' 


o' 


'M Rf^ 

0 

Y C 

0 

Q 


R Q^CQ 


where the right and left matrices in the multiplication are positive definite. Secondly, 
tr(C') = tr(C") where C = Q^CQ and assuming C to be zero outside the first mxm 
block can only reduce the objective function. Therefore, we can ignore the last n — m 
rows and columns of the above PSD matrix. 

More generally, because of left unitarily invariance, the optimal F’s in all of the 
optimization problems in this section have the same column space as the input matrix 
X; otherwise, a rotation as in (27) produces a feasible Y with a smaller value for the 
objective function. 

5. Algorithms for optimization with VGF. In this section, we discuss op¬ 
timization algorithms for solving convex minimization problems, in the form of (6), 
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with VGF penalties. The proximal operators of VGFs we studied in the previous 
section are the key parts of proximal gradient methods (see, e.g., [5, 6 , 36]). More 
specifically, when the loss function C{X) is smooth, we can iteratively update the 
variables as follows: 

= prox^^o(^^‘^ - f = 0 , 1 , 2 ,..., 

where 74 is a step size at iteration t. When C{X) is not smooth, then we can use sub¬ 
gradients of in the above algorithm, or use the classical subgradient method 

on the overall objective C{X) + An(Ar). In either case, we need to use diminishing 
step size and the convergence can be very slow. Even when the convergence is rela¬ 
tively fast (in terms of number of iterations), the computational cost of the proximal 
operator in each iteration can be very high. 

In this section, we focus on loss functions that have a special form shown in (28). 
This form comes up in many common loss functions, some of which listed later in this 
section, and allows for faster algorithms. We assume that the loss function C in ( 6 ) 
has the following representation: 

(28) C{X) = max (X, P(g)) - £(g), 

gey 

where £ : —>■ R is a convex function, C/ is a convex and compact subset of R^, 

and £> : R^’ —>■ is a linear operator. This is also known as a Fenchel-type 

representation (see, e.g., [24]). Moreover, consider the infimal post-composition [4, 
Definition 12.33] of £ : t/ —)> R by £>(•), defined as 

{V 0 t){Y) = inf {£(G) : V{G) = F , G £ G} . 

Then, the conjugate to this function is equal to £ . In other words, £(X) = £,*{'D*{X)) 
where £* is the conjugate function and T)* is the adjoint operator. The composition of 
a nonlinear convex loss function and a linear operator is very common for optimization 
of linear predictors in machine learning (e.g., [16]), which we will demonstrate with 
several examples later in this section. 

With the variational representation of £ in (28), and assuming 
we can write the VGF-penalized loss minimization problem ( 6 ) as a convex-concave 
saddle-point optimization problem: 

(29) Jopt = min max (X, £>(g)) - £(g)-|-Atr(XMX^). 

X MeMns+,ggy 

If £ is smooth (while £ may be nonsmooth) and the sets G and M are simple (e.g., 
admitting simple projections), we can solve problem (29) using the mirror-prox al¬ 
gorithm [35, 24]. In section 5.1, we present a variant of the mirror-prox algorithm 
equipped with an adaptive line search scheme. Then in Section 5.2, we present a pre¬ 
processing technique to transform problems of the form (29) into smaller dimensions, 
which can be solved more efficiently under favorable conditions. 

Before diving into the algorithmic details, we examine some common loss func¬ 
tions and derive the corresponding representation (28) for them. This discussion will 
provide intuition for the linear operator 7) and the set G in relation with data and 
prediction. 
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Norm loss. Given a norm || • || and its dual || • ||* , consider the squared norm loss 
£(x) = ipx- bf = max (g, Ax- b) - i(||g|r)^ . 

In terms of the representation in (28), here we have 12(g) = A^g and /1(g) = 
Kllsll*)^ + b^g. Similarly, a norm loss can be represented as 

£(x) = ||Ax - b|| = max {(x, A^g) - b^g ; ||g||* < 1}, 

g 

where we have 12(g) = A'^g, £(g) = b^g and G = {g '■ ||g||* < 1}. 

e-insensitive (deadzone) loss. Another variant of the absolute loss function is 
called the e-insensitive loss (e.g., see [34, Section 14.5A] for more details and applica¬ 
tions) and can be represented, similar to (28), as 

Ce{x) = (|a;| — e)+ = max {a{x — e) -|- (i{—x — s) : a, /3 > 0, a -I- ,5 < 1}. 

a,/3 

Hinge loss for binary classification. In binary classification problems, we are given 
a set of training examples (ai, 6 i),..., (aAr, Bn), where each a^ £ is a feature vector 
and bs £ {-1-1,—1} is a binary label. We would like to find x £ such that the linear 
function a^x can predict the sign of label bg for each s = 1,... ,N. The hinge loss 
max{0,1 — 6s(a^^)} returns 0 if 6s(a^x) > 1 and a positive loss growing with the 
absolute value of 6s(a^x) when it is negative. The average hinge loss over the whole 
data set can be expressed as 

/:(x) = ^ X;fLimax{0,1 - 6s(a^x)} = maxgge (g, 1 - Dx). 

where D = [6iai, ..., 6AraAr]^ . Here, in terms of (28), we have, G = {g & : 0 < 

gs < l/N } , 22(g) = -D^g , and £(g) = -I'^g . 

Multi-class hinge loss. For multiclass classification problems, each sample ag has 
a label bg £ (1,..., m}, for s = 1,... ,N. Our goal is to learn a set of classifiers 
Xi,... ,Xm, that can predict the labels bg correctly. For any given example ag with 
label 6g, we say the prediction made by Xi,..., x^ is correct if 

(30) xfag > xjag for all (i, j) £ I(6g), 

where Ik , for k = 1 ,..., m, characterizes the required comparisons to be made for 
any example with label k. Here are two examples. 

1. Flat multiclass classification: I{k) = {(k,j) : j k}. In this case, the 
constraints in (30) are equivalent to the label bg = argmaxjg^j^ xf ag; see [43]. 

2. Hierarchical classification. In this case, the labels {1,... ,rn} are organized 
in a tree structure, and each I{k) is a special subset of the edges in the tree depending 
on the class label k; see Section 6 and [14, 44] for further details. 

Given the labeled data set (ai, 6 i),..., (a^r, 6Ar), we can optimize X = [xi,..., x„i] 
to minimize the averaged multi-class hinge loss 

(31) £{X) = -^Y.s=i max{0, l-max(ij)ex(b,){xfag-xjag}}, 

which penalizes the amount of violation for the inequality constraints in (30). 

In order to represent the loss function in (31) in the form of (28), we need some 
more notations. Let pk = \I{k)\, and define Ek £ as the incidence matrix 
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for the pairs in 1^ ; i.e., each column of E^, corresponding to a pair {i,j) € Ik, has 
only two nonzero entries: —1 at the ith entry and +1 at the jth entry. Then the pk 
constraints in (30) can be summarized as E'^X'^a.s < 0. It can be shown that the 
multi-class hinge loss C{X) in (31) can be represented in the form (28) via 

V{g) = -A£{g), and £(g) =-I'^g, 

where A = [sli ■■■ a^r] and £{g) = [i^higi i?b„gAr]^ G Moreover, the 

domain of maximization in (28) is defined as 

(32) G = Gbi X ...X GbN where Gk = {g & : g > 0 , l^g < 1/fV} . 

Combining the above variational form for multi-class hinge loss and a VGF as penalty 
on X, we can reformulate the nonsmooth convex optimization problem minx {>C(^) + 
An]vt(-^)} as the convex-concave saddle point problem 

(33) min max l^g — (X, A£(g)) + XtiiXMX'^). 

X MGMns+,geS 

5.1. Mirror-prox algorithm with adaptive line search. The mirror-prox 
(MP) algorithm was proposed by Nemirovski [35] for approximating the saddle points 
of smooth convex-concave functions and solutions of variational inequalities with Lip- 
schitz continuous monotone operators. It is an extension of the extra-gradient method 
[25], and more variants are studied in [23]. In this section, we first present a variant of 
the MP algorithm equipped with an adaptive line search scheme. Then explain how 
to apply it to solve the VGF-penalized loss minimization problem (29). 

We describe the MP algorithm in the more general setup of solving variational 
inequality problems. Let 2^ be a convex compact set in Euclidean space £ equipped 
with inner product (•,•), and || • || and || • ||* be a pair of dual norms on £, i.e., 
Il^ll* = max||z||<i(^, z). Let F : 2 —>■ £ be a Lipschitz continuous monotone mapping: 

(34) WZjZ'gZ: \\F{z) — F{z')\\^: < L\\z — z'W, and, {F{z) — F{z'), z — z') > 0 . 

The goal of the MP algorithm is to approximate a (strong) solution to the variational 
inequality associated with (2, F): {F{z*),z — z*) > 0, Vz G 2. Let (j){x,y) be a 
smooth function that is convex in x and concave in y, and X and y be closed convex 
sets. Then the convex-concave saddle point problem 


min max (t){x,y), 

x£AC y^y 


can be posed as a variational inequality problem with z = {x,y), Z = X x y and 


(35) 


F(z) 


Xx(j){x,y) 

-Xy<j){x,y) 


The setup of the mirror-prox algorithm requires a distance-generating function 
h{z) which is compatible with the norm jj • jj. In other words, h{z) is subdifferentiable 
on the relative interior of 2, denoted 2°, and is strongly convex with modulus I with 
respect to || • jj, i.e., for all z,z' G 2, we have (Vh(z) — Vh(z'),z — z') > |]z — z'jj^. 
For any z G 2° and z' G 2, we can define the Bregman divergence at z as 


I4(z0 = H^') - h{z) - {Xh{z), z - z). 
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and the associated proximity mapping as 

P,(e) = argmin{(e,z') + V;(/)} = argmin{(^ - V/i(z), z') + M^')} • 

z'z'^Z 

With these definitions, we are now ready to present the MP algorithm in Figure 2. 
Compared with the original MP algorithm [35, 23], our variant employs an adaptive 
line search procedure to determine the step sizes 7 t, for t = 1,2,.... We can exit the 
algorithm whenever Vz-t{zt+i) < e for some e > 0. Under the assumptions in (34), 
the MP algorithm in Figure 2 enjoys the same 0{l/t) convergence rate as the one 
proposed in [35] , but performs much faster in practice. The proof requires only simple 
modifications of the proof in [35, 23]. 


Algorithm: Mirror-Prox(zi, 71 , e) 
repeat 
t != t T 1 
repeat 

7t := It/Cdec 
Wt ■■= PzAltF{zt)) 

Zt+I ■■= PzAltF{wt)) 
until St < 0 
yt+1 ■— CincTi 
until 14j(zt+i) < e 

return := 7r)~^ Et=i 7rWr 


Fig. 2: Mirror-Prox algorithm with adaptive line search. Here Cdec > 1 and Cinc > 1 are 
parameters controlling the decrease and increase of the step size 74 in the line search 
trials. The stopping criterion for the line search is (5t < 0 where St = jt(F(wt), wt — 
2t+l} - VzJZt+l) . 


When £ is smooth and Hjvt = flMns+ j we can apply MP algorithm to solve the 
saddle-point problem in (29). Then, the gradient mapping in (35) becomes 


(36) 


F(X,M,g) 


' vec(2XXM + V(g)) 
-Avec(A^A) 
vec(V£(g)-P*(X)) 


where £>*(•) is the adjoint operator to £’(•). Assuming g G , computing F requires 
0(nrn? -\- nmp) operations for matrix multiplications. In Section 5.2, we present a 
method that can potentially reduce the problem size by replacing n with min{mp, n} . 
In the case of SVM with the hinge loss as in our real-data numerical example, one 
can replace n by min{ A, mp, n} , where N is the number of samples. 

The assumption Hm = f2MnS+ provides us with a convex-concave saddle point 
optimization problem in (29). However, mirror-prox iterations for (29) require a 
projection onto M fl (or more generally, computation of the proximity mapping 
PziO corresponding to the mirror map we use and a set Z defined via M fl §+), and 
such projections might be much more complicated than projection onto M. In fact, 
while Hjvt = H]vtns+ implies that the achieving matrix in sup^g]y|;(M, X’^X) is always 
in M n S+, we need a separate guarantee to be able to project onto M and M (7 §+ 
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interchangeably. We remark on a guarantee for this in the following, where Lemma 
12 and Corollary 13 provide sufficient conditions for when projection of a PSD matrix 
onto M is equivalent to projection onto M fl S+. 

Lemma 12. For any G ^ 0, consider P = n]vt(G) and its Moreau decomposition 
with respect to the positive semidefinite cone as P = P+ — P- where P+,P_ P 0 and 
(P_i_,P_) = 0. Then, P-|_ S M implies P_ = 0 . 

Proof. Recall the firm nonexpansive property of the projection operator onto a 
convex set [39] applied to P = n]vt(G) and P+ = nM(P+) (implied by P+ £ M). We 
get jjP — P+llp < (P — P+, G — P+) which implies {P-,G) + ||P-|||- < 0. Moreover, 
for two PSD matrices G and P_ we have (G, P_) > 0. All in all, P_ = 0. □ 

Corollary 13. Provided that for any M G M. we have M+ £ M, then JIm is 
convex. Moreover, nM(G) P 0 for all G P 0. 

Corollary 13 establishes an important property about the iterates of the mirror-prox 
algorithm with h{-) = ^|| • Hi as the mirror map, corresponding to Pz( 5 ) = — ffj. 

If in Algorithm 2 we initialize the part of zi corresponding to M’s to be a PSD 
matrix, all of such parts in the iterations Zt and wt remain PSD as 1) we add a PSD 
matrix (AA^A from (36)) to the previous iteration, and, 2) the projection onto M 
(which is not necessarily a subset of the PSD cone) ends up being a PSD matrix (by 
Corollary 13), hence it is equivalent to projection onto MnS+. Notice that such 
condition is required for applying the mirror-prox algorithm: the objective has to 
be convex-concave and the positive semidefiniteness of all iterations guarantees this 
property. 

The above provides a glimpse into a more general approach in optimization with 
composite functions. While every convex function has a variational representation 
in terms of its conjugate function, namely r2M(A) = supy (A, A) — n^(A), such 
expressions do not necessarily offer any computational advantage. With a more clever 
exploitation of the structure, r 2 M(A) can be seen as a composition of the support 
function 5 'm(-) with a structure mapping g{X) = A’^A, as in (15). Then, 

min £(A)-|-nM(A) = min sup £(A)-|-( 5 (A), A) — 5']^(y) 

^ X Y 

= min sup £(A)-|-(A^A, A) 

X YeM 

where we use the fact that 5'j(t(A) is the indicator function for the set M. This can be 
seen as an interpretation for how our proposed algorithm replaces proximal mapping 
computations for JIm with projections onto M (proximal mapping for the indicator 
function for Dm). Of course, to be able to use convex optimization algorithms, we 
will need to establish results similar to Lemma 12 and Corollary 13. 

5.2. A Kernel Trick (Reduced Formulation). As we discussed earlier, when 
the loss function has the structure (28), we can write the VGF-penalized minimization 
problem as a convex-concave saddle point problem 


min max (A, £>(g)) — £(g)-|-A D(A). 

gGS 


(37) 


•^opt — 
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Since Q is compact, is convex in X , and C is convex in g, we can use the minimax 
theorem to interchange the max and min. Then, for any orthogonal matrix Q we have 

Jopt = max min (X, V{g)) - £(g) + A 

geG X 

= max min {Q'^X, Q'^'D{g)) - £(g) + A X) 

geG X 

(38) = max min {X, Q'^'D{g)) — £(g) + A n(X) 

geG X 

where the second equality is due to the left unitarily invariance of , and we renamed 
the variable X to get the third equality. Observe that Q is an arbitrary orthogonal 
matrix in (38) and can be chosen in a clever way to simplify V as described in the 
sequel. Since T’(g) is linear in g, consider a representation as 


(39) V{g) = [Dig ■ ■ ■ Dm^] = [Di ■■■ Dm]{Im O g) = D(/m 0 g), 


for some Di G and D G . Then, express D as the product of an or¬ 

thogonal matrix and a residue matrix, such as in QR decomposition D = QD, where 
provided that n > mp , only the first mp rows of R can be nonzero (will be denoted 
by i?i). Define 2?'(g) = i?i(/m0g) £ for g = min{mp,n} . Plugging the above 

choice of Q in (38) gives 


Topt = max min 
geG Xi,X2 


Al' 


''D'is) 

A 2 

5 

0 


-£(g) + AD( 


^1 

^2 


)■ 


Observe that setting X 2 to zero does not increase the value of ft which allows for 
restricting the above to the subspace X 2 = 0 and getting 


(40) Jopt = ^min max {X,V'{g)) - C{g) + Afi(X) 

gGy 


whose X variable has q = min{mp, n} rows compared to n rows in (6). 

Notice that while the evaluation of Jopt via (40) can potentially be more efficient, 
we are interested in recovering an optimal point X in (37) which is different from the 
optimal points in (40). Tracing back the steps we took from (37) to (40), we get 

^^p[^ = Q 


^opt 

0 


The special case of regularization with squared Euclidean norm has been under¬ 
stood and used before; e.g., see [41]. However, the above derivations show that we 
can get similar results when the regularization can be represented as a maximum of 
squared weighted Euclidean norms. 

It is worth mentioning that the reduced formulation in (40) can be similarly 
derived via a dual approach; one has to take the dual of the loss-regularized optimiza¬ 
tion problem (e.g., see Example 11.41 in [39]), use the left unitarily invariance of the 
conjugate VGF to reduce V to V, and dualize the problem again, to get (40). 

5.3. A Representer Theorem. A general loss-regularized optimization prob¬ 
lem as in (6) where the loss admits a Fenchel-type representation and the regularizer 
is a strongly convex VGF (including all squared vector norms) enjoys a representer 
theorem (see, e.g., [41]). More specifically, the optimal solution is linearly related 
to the linear operator V in the representation of the loss. As mentioned before, for 
many common loss functions, D encodes the samples, which reduces the following 
proposition to the usual representer theorem. 
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Proposition 14. For a loss-regularized minimization problem as in (6) where 
M C §!p_|_ and £ admits a Fenchel-type representation as 

C{X) = max {X, V{g)) - £(g) = max {X, D(/„ 0 g)) - £(g), 
gey gey 

the optimal solution Xopt admits a representation of the form 

Xopt = DC 

with a coefficient matrix C given by C = —^A4^].®gopt (optimal solutions of {‘29)). 

Proof. Denote the optimal solution of (29) by (-^opt, gopt,-^opt), which shares 
(Xoptjgopt) with (37). Consider the optimality condition as — ^^(gopt) G 9D(Xopt) 
which implies Xopt S i9D*(—j£>(gopt)). Now, suppose M C S™.,. which implies Dm 
is strongly convex. Considering the characterization of subdifferential for D* from 
Proposition 7 as well as the representation of £’(g) in (39) we get 

-Yopt = ~ 2A^(Sopt)-^opt = ® Sopt).Mopt = ~ 2A^(-^opt ^ gopt) • D 

This representer theorem allows us to apply our methods in more general re¬ 
producing kernel Hilbert spaces (RKHS) by choosing a problem specific reproducing 
kernel; e.g., see [41, 44]. 

6 . Numerical Example. In this section, we discuss the application of VGFs 
in hierarchical classification to demonstrate the effectiveness of the presented algo¬ 
rithms in a real data experiment. More specifically, we compare the modified mirror- 
prox algorithm with adaptive line search presented in Section 5.1 with the variant of 
Regularized Dual Averaging (RDA) method used in [44] in the text categorization 
application discussed in [44]. 

Let (ai, 6 i),..., (avr, fevr) be a set of labeled data where each a^ s R" is a feature 
vector and the associated bi € {1,...,to} is a class label. The goal of multi-class 
classification is to learn a classification function / : K" —>■ {1 ,... ,m} so that, given 
any sample a G M" (not necessarily in the training set), the prediction /(a) attains a 
small classification error compared with the true label. 

In hierarchical classification, the class labels {1,..., m} are organized in a category 
tree, where the root of the tree is given the fictious label 0 (see Figure 3a). For 
each node i G {0,1,... ,m}, let C{i) be the set of children of i, S{i) be the set of 
siblings of i, and A{i) be the set of ancestors of i excluding 0 but including itself. A 
hierarchical linear classifier /(a) is defined in Figure 3b, which is parameterized by 
the vectors xi,..., x^ through a recursive procedure. In other words, an instance is 
labeled sequentially by choosing the category for which the associated vector outputs 
the largest score among its siblings, until a leaf node is reached. An example of this 
recursive procedure is shown in Figure 3a. For the hierarchical classifier defined above, 
given an example with label 6s, a correct prediction made by /(a) implies that (30) 
holds with 

T{k) = {{i,j) : j G 5(i), i G A{k)}. 

Given a set of examples (ai, 6 i),..., (a^v, b]\[), we can train a hierarchical classifier 
parametrized by A = [xi,... ,Xm] by solving the problem minx{£(A)-|-AD(A)}, with 
the loss function £(A) defined in (31) and an appropriate VGF penalty function D(A). 
As discussed in Section 5, the training optimization problem can be reformulated as a 
convex-concave saddle point problem of the form (29) and solved by the mirror-prox 
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Fig. 3: (3a): An example of hierarchical classification with four class labels {1,2, 3,4}. 
The instance a is classified recursively until it reaches the leaf node 6 = 3, which is 
its predicted label. (3b): Definition of the hierarchical classification function. 


algorithm described in Section 5.1. In addition, we can use the reduction procedure 
discussed in Section 5.2 to reduce computational cost. 

As discussed in [44], one can assume a model where classification at different 
levels of the hierarchy rely on different features or different combination of features. 
Therefore, authors in [44] proposed regularization with [x^’x^j whenever j G A.{i) . A 
convex formulation of such a regularization function can be given in the form (4) with 

M = {M : Mu = M^^ , [My] = \M,^\) 

where the nonzero pattern of M corresponds to the pairs of ancestor-descendant nodes. 
According to (17), we have M C provided that Aniin(AF) > 0; see Figure 1. 

As a real-world example, we consider the classification dataset Reuters Corpus 
Volume I, RCVl-v2 [28], which is an archive of over 800,000 manually categorized 
newswire stories and is available in libSVM. A subset of the hierarchy of labels in 
RCVl-v2, with m = 23 labels (18 leaves), is called ECAT and is used in our experi¬ 
ments. The samples and the classifiers are of dimension n = 47236. Lastly, there are 
2196 training, and 69160 test samples available. 

We solve the same loss-regularized problem as in [44] , but using mirror-prox (dis¬ 
cussed in Section 5.1) instead of regularized dual averaging (RDA). The regularization 
function is a VGF and is given in (4). A reformulation of the whole problem as a 
smooth convex-concave problem is given in (33). To obtain comparable results, we 
use the same matrix M and regularization parameter A = 1 as in [44]. Note that in 
this experiment, n = 47236 while to = 23 and p > 2196, so the kernel trick is not 
particularly useful since n is not larger than mp. 

Since we are solving the same problem as [44], the prediction error on test data 
will be the same as the error reported in this reference, which is better than the 
other methods. Moreover, one can look at the estimated classifiers and how well 
they validate the orthogonality assumption. Figure 4 compares the pairwise inner 
products of classifiers estimated by our approach for hierarchical classification and 
those estimated by “transfer” method (see [44] for details on this method). 

In the setup of the mirror-prox algorithm, we use ^|] • \\\ as the mirror map 
which requires the least knowledge about the optimization problem (see [23] for the 
requirements when combining a number of mirror maps corresponding to different 
constraint sets in the saddle point optimization problem). With this mirror map, the 
steps of mirror-prox only require orthogonal projection onto Q and M. The projection 
onto Q in (32) boils down to separate projections onto N scaled simplexes (where the 
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Fig. 4: Pairwise angles (in degrees) between the estimated classifiers for dataset 
MCAT (part of RCVl-v2 [28]) via (left) regularization by the VGF in (4) and (right) 
the “transfer” method (see [44] and references therein). The boldface entries in red 
correspond to ancestor-descendant relations in the hierarchy of MCAT labels. 


summation of entries is bounded by 1 and not necessarily equal to 1). Each projection 
amounts to zeroing out the negative entries followed by a projection onto the i\ unit 
norm ball (e.g., using the simple process described in [15]). 

The variant of RDA proposed in [44] has a convergence rate of 0{\x\{t)/at) for 
the objective value, where a is the strong convexity parameter of the objective. On 
the other hand, mirror-prox enjoys a convergence rate of 0{l/t) as given in [35]. 
Although there is a clear advantage to the MP method compared to RDA in terms of 
the theoretical guarantee, one should be aware of the difference between the notions 
of gap for the two methods. Figure 5a compares \\Xt — AfinaijlF for MP and RDA 
using each one’s own final estimate Agnai ■ In terms of the runtime, we empirically 
observe that each iteration of MP takes about 3 times more time compared to RDA. 
However, as evident from Figure 5a, MP is still much faster in generating a fixed- 
accuracy solution. Figure 5b illustrates the decay in the value of the gap for mirror- 
prox method, Vz-t{zt+i ), which confirms the theoretical convergence rate of 0{l/t). 




Fig. 5: Convergence behavior for mirror-prox and RDA in our numerical experiment, 
(a) Average error over the m classifiers between each iteration and the final estimate, 
\\Xt — Afinaillu ■ (b) MP’s gap I4j(zt+i). (c) The value of loss function relative to the 
final value. For visualization purposes, all of the plots show data points at every 10 
iterations. All vertical axes have a logarithmic scale. 
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7. Discussion. In this paper, we introduce variational Gram functions, which 
include many existing regularization functions as well as important new ones. Con¬ 
vexity properties of this class, conjugate functions, subdifferentials, semidefinite rep- 
resentability, proximal operators, and other convex analysis properties are studied. 
By exploiting the structure in loss and the regularizer, namely C{X) = C*{'D*{X)) 
and we provide various tools and insight into such regularized 

loss minimization problems: By adapting the mirror-prox method [35], we provide 
a general and efficient optimization algorithm for VGF-regularized loss minimization 
problems. We establish a general kernel trick and a representer theorem for such 
problems. Finally, the effectiveness of VGF regularization as well as the efficiency 
of our optimization approach is illustrated by a numerical example on hierarchical 
classification for text categorization. 

There are numerous directions for future research on this class of functions. One 
issue to address is how to systematically pick an appropriate set M when defining a 
new VGF for some new application. Statistical properties of VGFs, for example the 
corresponding sample complexity, are of interest from a learning theory perspective. 
The presented kernel trick (which uses the left unitarily invariance property of VGFs) 
can be potentially extended to other invariant regularizers. And last but not least, it 
is interesting to see if there is a variational Gram representation for any squared left 
unitarily invariant norm. 
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Appendix A. Additional Proofs. 

Proof of Lemma J^. First, assume that LI is convex. By plugging in X and —X in 
the dehnition of convexity for LI we get Ll{X) > 0 , so the square root is well-defined. 
We show the triangle inequality y^Ll{X -h Y) < y^Ll.{X) y^Ll(Y) holds for any X, Y. 
If n(A -l-P) is zero, the inequality is trivial. Otherwise, for any 9 C (0, 1) let A = \X, 
B = j^Y, and use the convexity and second-order homogeneity of LI to get 

(41) Ll{X + Y) = Ll{0A -h (1 - 0)B) < 0LI{A) -h (1 - 9)L1{B) = lLl{X) + j^LlfY). 

If f2(A) > Ll{Y) = 0, set 0 = (11(A) -h Ll{X -h Y))/{2Ll{X -h Y)) > 0. Notice that 
9 > 1 provides f2(A) > L1{X -\- Y) as desired. On the other hand, if 0 < 1, we can 
use it in (41) to get the desired result as 

n(x + y) < 1 0 (A) = ^ n(x) > o(A + y). 

And if Ll{X), Ll{Y) ^ 0 , set 6» = ^Ll{X)/[yfLl{X) Y ^/Wy)) g (0,1) to get 
Ll{X + A) < i Ll{X) + Ll{Y) = {^LI(X) + ^/Ll^f . 


Since satisfies the triangle inequality and absolute homogeneity, it is a semi-norm. 
Notice that n(A) = 0 does not necessarily imply A = 0, unless LI is strictly convex. 

Now, suppose that is a semi-norm; hence convex. The function / defined by 
f{x) = for a: > 0 and /(x) = 0 for x < 0 is non-decreasing, so the composition of 
these two functions is convex and equal to fl. It is worth mentioning that one can 
alternatively use Corollary 15.3.2 of [37] to prove the first part of the lemma. □ 

Lemma 15. Consider any norm || • || . Then, ||| - ||| is a norm itself if and only if 
we have |||x||| = miny>| 3 ;| ||j/|| . 

Proof of Lemma 15. First, suppose || • ||o := || j-j || is a norm; hence it is an absolute 
norm and is monotonic as well by definition. Therefore, for any y > jxj we have ||?/|la > 
||x||a which gives miny>| 2 ;| ||2/||a > ||ai||a • Since jxj is feasible in this optimization, and 
|||x|||a = ||x|la we get the desired result; |||x||| = ||x||a = miny>| 3 ,| Ijylj . 

On the other hand, consider /(•) := miny>| 3 ,| ||?/||. We show that it is a norm. 
Clearly, / is nonnegative and homogenous, and /(x) = 0 implies that ||?/|| = 0 for 
some 2 / > jxj > 0 which implies x = 0 . The triangle inequality can be verified as, 


f{x + z) 


min Hull < min 

V>\x-\-z\ y>|a:|-|-h| 


min 

yi>|a^l . y2>bl 


\\yi+y2\\ 


< min 
yi>|a:| , y2>|z| 


Ill/ill + II2/2II = f{x)+f{z). 


□ 











